Variables selected : Weekly sales, Temperature, Unemployment rate and Fuel price

Creation of our dataset that will take the mean of the 45 stores for each variable at each date of our original dataset. Our new dataset is called "mean_data"

In [1]:
import pandas as pd

df = pd.read_csv('walmart-sales-dataset-of-45stores.csv')

df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')

df_avg = df.groupby('Date').mean()

df_avg.to_csv('mean_data.csv')

Part 1¶

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
In [3]:
def _plot_series1(df, col1, col2, series_name, series_index=0):
    fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
    df_sorted = df.sort_values(col1, ascending=True)
    palette = list(sns.palettes.mpl_palette('Dark2'))
    xs = df_sorted[col1]
    ys = df_sorted[col2]
    plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])
    sns.despine(fig=fig, ax=ax)
    plt.xlabel(col1)
    _ = plt.ylabel(col2)
    plt.title(f'{col2} evolution in time')
In [4]:
def _plot_series3(df, col1, col2, col3, title, period='in time'):
    fig, ax1 = plt.subplots(figsize=(10, 6))

    color1 = 'tab:red'
    ax1.set_xlabel(col1)
    ax1.set_ylabel(col2, color=color1)
    ax1.plot(df[col1], df[col2], label=col2, color=color1)
    ax1.tick_params(axis='y', labelcolor=color1)

    ax2 = ax1.twinx()
    color2 = 'tab:blue'
    ax2.set_ylabel(col3, color=color2)
    ax2.plot(df[col1], df[col3], label=col3, color=color2)#, linestyle='--')
    ax2.tick_params(axis='y', labelcolor=color2)

    plt.title(title)
    sns.despine(ax=ax1, right=False)
    sns.despine(ax=ax2, left=True)

    lines1, labels1 = ax1.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    plt.legend(lines1 + lines2, labels1 + labels2, loc='upper left')
    plt.title(f'{col2} and {col3} evolution {period}')
    plt.show()
In [5]:
data = pd.read_csv('mean_data.csv')
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d')

Vizualisations and analysis of the trend, the noise and the seasonality of each variable

Weekly Sales variable:

In [6]:
from statsmodels.tsa.seasonal import seasonal_decompose

data.set_index('Date', inplace=True)

# Plot the original time series data
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Weekly_Sales'], label='Original Data')
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()
plt.show()

rolling_mean = data['Weekly_Sales'].rolling(window=12).mean()

# Plot original data and rolling mean
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Weekly_Sales'], label='Original Data')
plt.plot(data.index, rolling_mean, label='Rolling Mean', color='red')
plt.title('Original Data vs Rolling Mean (Trend)')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()
plt.show()

detrended_data = data['Weekly_Sales'] - rolling_mean

decomposition = seasonal_decompose(data['Weekly_Sales'], model='additive', period=12)

# Plot trend, seasonality and noise of the variable
plt.figure(figsize=(10, 8))
plt.subplot(411)
plt.plot(data.index, decomposition.trend, label='Trend')
plt.title('Trend')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()

plt.subplot(412)
plt.plot(data.index, decomposition.seasonal, label='Seasonality')
plt.title('Seasonality')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()

plt.subplot(413)
plt.plot(data.index, decomposition.resid, label='Residuals')
plt.title('Noise')
plt.xlabel('Date')
plt.ylabel('Weekly_Sales')
plt.legend()

plt.tight_layout()
plt.show()

These visualizations reveal discernible trends and seasonality patterns within the data. Notably, there is a distinct upward trend observed towards the end of each year, spanning from October to January, indicating significant increases in weekly sales during these months, followed by a period of stabilization throughout the remainder of the year. This trend exhibits an annual cycle. Additionally, the presence of seasonality is evident from the regular peaks occurring approximately every two months. Furthermore, consistent noise patterns are apparent, with peaks occurring annually, contributing to the overall variability in the data. The graph using the rolling mean renforces our idea.

Fuel Prices variable:

In [7]:
from statsmodels.tsa.seasonal import seasonal_decompose

# Plot the original time series data
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Fuel_Price'], label='Original Data')
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()
plt.show()

rolling_mean = data['Fuel_Price'].rolling(window=12).mean()

# Plot original data and rolling mean
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Fuel_Price'], label='Original Data')
plt.plot(data.index, rolling_mean, label='Rolling Mean', color='red')
plt.title('Original Data vs Rolling Mean (Trend)')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()
plt.show()

detrended_data = data['Fuel_Price'] - rolling_mean

decomposition = seasonal_decompose(data['Fuel_Price'], model='additive', period=12)

# Plot trend, noise and seasonality of the variable
plt.figure(figsize=(10, 8))
plt.subplot(411)
plt.plot(data.index, decomposition.trend, label='Trend')
plt.title('Trend')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()

plt.subplot(412)
plt.plot(data.index, decomposition.seasonal, label='Seasonality')
plt.title('Seasonality')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()

plt.subplot(413)
plt.plot(data.index, decomposition.resid, label='Residuals')
plt.title('Noise')
plt.xlabel('Date')
plt.ylabel('Fuel_Price')
plt.legend()

plt.tight_layout()
plt.show()

Regarding the Fuel Prices variable, there is a noticeable upward trend observed over the years (price of 2.7 in 2010 and 3.8 in 2012), characterized by periodic peaks occurring consistently in the month of May annually. This trend graph illustrates a clear upward trajectory. Additionally, a regular seasonality pattern is discernible, with cycles recurring every three months. However, there appears to be no discernible pattern in the noise associated with this variable.

Unemployment variable

In [8]:
from statsmodels.tsa.seasonal import seasonal_decompose

# Plot the original time series data
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Unemployment'], label='Original Data')
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.show()

rolling_mean = data['Unemployment'].rolling(window=12).mean()

# Plot original data and rolling mean
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Unemployment'], label='Original Data')
plt.plot(data.index, rolling_mean, label='Rolling Mean', color='red')
plt.title('Original Data vs Rolling Mean (Trend)')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.show()

detrended_data = data['Unemployment'] - rolling_mean

decomposition = seasonal_decompose(data['Unemployment'], model='additive', period=12)

# Plot trend, noise and seasonality of the variable
plt.figure(figsize=(10, 8))
plt.subplot(411)
plt.plot(data.index, decomposition.trend, label='Trend')
plt.title('Trend')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()

plt.subplot(412)
plt.plot(data.index, decomposition.seasonal, label='Seasonality')
plt.title('Seasonality')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()

plt.subplot(413)
plt.plot(data.index, decomposition.resid, label='Residuals')
plt.title('Noise')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()

plt.tight_layout()
plt.show()

Examining the graph depicting the Unemployment rate over time, we observe a consistent decline from 8.75 in 2010 to 7 in 2012, resembling a staircase-like descent pattern. Both the trend graph and the rolling mean graph reinforce this observation, illustrating a distinct downward trend over time. Moreover, the seasonality graph reveals a recurring pattern every three months, while the noise appears to be irregular.

Temperature variable

In [9]:
from statsmodels.tsa.seasonal import seasonal_decompose

# Plot the original time series data
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Temperature'], label='Original Data')
plt.title('Original Time Series Data')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()
plt.show()

rolling_mean = data['Temperature'].rolling(window=12).mean()

# Plot original data and rolling mean
plt.figure(figsize=(10, 6))
plt.plot(data.index, data['Temperature'], label='Original Data')
plt.plot(data.index, rolling_mean, label='Rolling Mean', color='red')
plt.title('Original Data vs Rolling Mean (Trend)')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()
plt.show()

detrended_data = data['Temperature'] - rolling_mean

decomposition = seasonal_decompose(data['Temperature'], model='additive', period=12)

# Plot noise, trend, seasonality of the variable
plt.figure(figsize=(10, 8))
plt.subplot(411)
plt.plot(data.index, decomposition.trend, label='Trend')
plt.title('Trend')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()

plt.subplot(412)
plt.plot(data.index, decomposition.seasonal, label='Seasonality')
plt.title('Seasonality')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()

plt.subplot(413)
plt.plot(data.index, decomposition.resid, label='Residuals')
plt.title('Noise')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.legend()

plt.tight_layout()
plt.show()

Analyzing our dataset reveals a pronounced annual pattern, with temperatures rising from February to August and declining from August to January, consistent with our expectations. The trend graph illustrates this annual pattern, while the seasonality graph depicts a three-month cycle. However, the noise graph does not display any discernible regular pattern in our data.

Analysis of the influence of each variable on the Weekly Sales

Impact of holidays on the weekly sales

In [10]:
data = pd.read_csv('mean_data.csv')
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d')
In [11]:
_plot_series3(data, 'Date', 'Weekly_Sales', 'Holiday_Flag', '')

This graph unmistakably illustrates the influence of holidays on our weekly sales, characterized by a substantial increase during holiday periods, aligning with our expectations. Consequently, we can infer that holidays exert a significant impact on weekly sales.

Impact of the temperature on the weekly sales

In [12]:
_plot_series3(data, 'Date', 'Weekly_Sales', 'Temperature', '')
In [13]:
_plot_series1(data, 'Temperature', 'Weekly_Sales','')

These graphs reveal a notable relationship between the Temperature variable and our Weekly Sales variable. Particularly, we observe that for extreme temperatures (exceeding 50), there's a decrease in weekly sales. The first graph illustrates an inverse correlation between temperatures and weekly sales. As temperatures rise, weekly sales remain low and stable, whereas with decreasing temperatures, weekly sales spike. This phenomenon can be attributed to the tendency for people in the USA to reduce outdoor activities during periods of elevated temperatures.

Impact of the fuel prices on the weekly sales

In [14]:
_plot_series3(data, 'Date', 'Weekly_Sales', 'Fuel_Price', '')
In [15]:
_plot_series1(data, 'Fuel_Price', 'Weekly_Sales','')

While this graph doesn't depict a strong correlation between weekly sales and fuel prices, it does suggest that when fuel prices are at their highest during the year, weekly sales remain low and stable. Conversely, when there's a decrease in fuel prices, we observe a spike in weekly sales.

Impact of the unemployment rate on the weekly sales

In [16]:
_plot_series3(data, 'Date', 'Weekly_Sales', 'Unemployment', '')

According to the visualization we observed, there doesn't appear to be a correlation between the unemployment rate and weekly sales.

Part 2¶

In [17]:
from prophet import Prophet
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import mean_squared_error,mean_absolute_error
import numpy as np

Prophet model

Weekly_Sales

In [18]:
df = pd.read_csv('mean_data.csv')

df_avg_sales = df.groupby(['Date'])['Weekly_Sales'].mean().reset_index()
In [19]:
df_avg_sales.rename(columns={'Date': 'ds', 'Weekly_Sales': 'y'}, inplace=True)
In [20]:
df_avg_sales['ds'] = pd.to_datetime(df_avg_sales['ds'])
In [21]:
model = Prophet()
model_fit = model.fit(df_avg_sales)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/ohx9l7mb.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/36srii73.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=336', 'data', 'file=/tmp/tmpnr05yd_d/ohx9l7mb.json', 'init=/tmp/tmpnr05yd_d/36srii73.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modeliumalliw/prophet_model-20240509170339.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:03:39 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:03:39 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [22]:
future_dates = model_fit.make_future_dataframe(periods=90)
forecast = model_fit.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
In [23]:
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales['ds'], df_avg_sales['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')

plt.xlabel('Date')
plt.ylabel('Weekly Sales Value')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
In [24]:
model = Prophet()
model.fit(df_avg_sales)
future_dates = model.make_future_dataframe(periods=365*2)
forecast = model.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/coa7z28b.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/5bib_afs.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=31678', 'data', 'file=/tmp/tmpnr05yd_d/coa7z28b.json', 'init=/tmp/tmpnr05yd_d/5bib_afs.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_model8ziacp8p/prophet_model-20240509170341.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:03:41 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:03:41 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [25]:
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales['ds'], df_avg_sales['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')

plt.xlabel('Date')
plt.ylabel('Weekly Sales Value')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
In [26]:
forecast = model.predict(df_avg_sales)
In [27]:
future = model.make_future_dataframe(periods=365*2)
forecast = model.predict(future)
fig=model.plot(forecast)

fig.gca().set_xlabel("Date")
fig.gca().set_ylabel("Weekly Sales Value")
Out[27]:
Text(87.59722222222221, 0.5, 'Weekly Sales Value')
In [28]:
observed_values = df_avg_sales['y']
predicted_values = forecast['yhat']

predicted_trimmed = predicted_values.iloc[:len(observed_values)]

mae = mean_absolute_error(observed_values, predicted_trimmed)
mse = mean_squared_error(observed_values, predicted_trimmed)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 45093.7644299397
Mean Squared Error (MSE): 6039856547.081733
Root Mean Squared Error (RMSE): 77716.51399208364

Temperature

In [29]:
df = pd.read_csv('mean_data.csv')

df_avg_sales_Temperature = df.groupby(['Date'])['Temperature'].mean().reset_index()
In [30]:
df_avg_sales_Temperature.rename(columns={'Date': 'ds', 'Temperature': 'y'}, inplace=True)
In [31]:
df_avg_sales_Temperature['ds'] = pd.to_datetime(df_avg_sales_Temperature['ds'])
In [32]:
model = Prophet()
model_fit = model.fit(df_avg_sales_Temperature)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/g9mcgcct.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/06dtztqy.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=87329', 'data', 'file=/tmp/tmpnr05yd_d/g9mcgcct.json', 'init=/tmp/tmpnr05yd_d/06dtztqy.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_model3f3za3p3/prophet_model-20240509170347.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:03:47 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:03:47 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [33]:
future_dates = model_fit.make_future_dataframe(periods=90)
forecast = model_fit.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
In [34]:
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Temperature['ds'], df_avg_sales_Temperature['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')

plt.xlabel('Date')
plt.ylabel('Temperature')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
In [35]:
model = Prophet()
model.fit(df_avg_sales_Temperature)
future_dates = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/ydwc_kxe.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/dacni0wo.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=38568', 'data', 'file=/tmp/tmpnr05yd_d/ydwc_kxe.json', 'init=/tmp/tmpnr05yd_d/dacni0wo.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modell2ptilpu/prophet_model-20240509170349.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:03:49 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:03:49 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [36]:
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Temperature['ds'], df_avg_sales_Temperature['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')

plt.xlabel('Date')
plt.ylabel('Temperature')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
In [37]:
forecast = model.predict(df_avg_sales_Temperature)
In [38]:
future = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future)
fig=model.plot(forecast)

fig.gca().set_xlabel("Date")
fig.gca().set_ylabel("Temperature")
Out[38]:
Text(92.09722222222221, 0.5, 'Temperature')
In [39]:
observed_values = df_avg_sales_Temperature['y']
predicted_values = forecast['yhat']

predicted_trimmed = predicted_values.iloc[:len(observed_values)]

mae = mean_absolute_error(observed_values, predicted_trimmed)
mse = mean_squared_error(observed_values, predicted_trimmed)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 1.6586769187551424
Mean Squared Error (MSE): 4.952063967332992
Root Mean Squared Error (RMSE): 2.225323339951521

Unemployment

In [40]:
df = pd.read_csv('mean_data.csv')

df_avg_sales_Unemployment = df.groupby(['Date'])['Unemployment'].mean().reset_index()
In [41]:
df_avg_sales_Unemployment.rename(columns={'Date': 'ds', 'Unemployment': 'y'}, inplace=True)
In [42]:
df_avg_sales_Unemployment['ds'] = pd.to_datetime(df_avg_sales_Unemployment['ds'])
In [43]:
model = Prophet()
model_fit = model.fit(df_avg_sales_Unemployment)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/82t8gzsv.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/so8wwnxc.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=81468', 'data', 'file=/tmp/tmpnr05yd_d/82t8gzsv.json', 'init=/tmp/tmpnr05yd_d/so8wwnxc.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modelr81_3pdp/prophet_model-20240509170352.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:03:52 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:03:53 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [44]:
future_dates = model_fit.make_future_dataframe(periods=90)
forecast = model_fit.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
In [45]:
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Unemployment['ds'], df_avg_sales_Unemployment['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')

plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
In [46]:
model = Prophet()
model.fit(df_avg_sales_Unemployment)
future_dates = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/0obtr2o1.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/9qnawtv2.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=22855', 'data', 'file=/tmp/tmpnr05yd_d/0obtr2o1.json', 'init=/tmp/tmpnr05yd_d/9qnawtv2.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modeljsmase6z/prophet_model-20240509170353.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:03:53 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:03:53 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [47]:
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Unemployment['ds'], df_avg_sales_Unemployment['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')

plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
In [48]:
forecast = model.predict(df_avg_sales_Unemployment)
In [49]:
future = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future)
fig=model.plot(forecast)

fig.gca().set_xlabel("Date")
fig.gca().set_ylabel("Unemployment")
Out[49]:
Text(100.84722222222221, 0.5, 'Unemployment')
In [50]:
observed_values = df_avg_sales_Unemployment['y']
predicted_values = forecast['yhat']

predicted_trimmed = predicted_values.iloc[:len(observed_values)]

mae = mean_absolute_error(observed_values, predicted_trimmed)
mse = mean_squared_error(observed_values, predicted_trimmed)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 0.023509013836708368
Mean Squared Error (MSE): 0.0011406637919996426
Root Mean Squared Error (RMSE): 0.033773714512911404

Fuel_Price

In [51]:
df = pd.read_csv('mean_data.csv')

df_avg_sales_Fuel = df.groupby(['Date'])['Fuel_Price'].mean().reset_index()
In [52]:
df_avg_sales_Fuel.rename(columns={'Date': 'ds', 'Fuel_Price': 'y'}, inplace=True)
In [53]:
df_avg_sales_Fuel['ds'] = pd.to_datetime(df_avg_sales_Fuel['ds'])
In [54]:
model = Prophet()
model_fit = model.fit(df_avg_sales_Fuel)
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/u_mdv96m.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/sg97ka15.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=11337', 'data', 'file=/tmp/tmpnr05yd_d/u_mdv96m.json', 'init=/tmp/tmpnr05yd_d/sg97ka15.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modelkqoxx6cu/prophet_model-20240509170357.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:03:57 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:03:57 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [55]:
future_dates = model_fit.make_future_dataframe(periods=90)
forecast = model_fit.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
In [56]:
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Fuel['ds'], df_avg_sales_Fuel['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')

plt.xlabel('Date')
plt.ylabel('Fuel Price')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
In [57]:
model = Prophet()
model.fit(df_avg_sales_Fuel)
future_dates = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future_dates)
predicted_trend = forecast['trend']
predicted_seasonality = forecast['yearly']
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this.
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this.
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/2ldkimy1.json
DEBUG:cmdstanpy:input tempfile: /tmp/tmpnr05yd_d/qd1p2srl.json
DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: None
DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=17347', 'data', 'file=/tmp/tmpnr05yd_d/2ldkimy1.json', 'init=/tmp/tmpnr05yd_d/qd1p2srl.json', 'output', 'file=/tmp/tmpnr05yd_d/prophet_modelli4bmhat/prophet_model-20240509170358.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000']
17:03:58 - cmdstanpy - INFO - Chain [1] start processing
INFO:cmdstanpy:Chain [1] start processing
17:03:58 - cmdstanpy - INFO - Chain [1] done processing
INFO:cmdstanpy:Chain [1] done processing
In [58]:
plt.figure(figsize=(10, 6))
plt.plot(df_avg_sales_Fuel['ds'], df_avg_sales_Fuel['y'], color='blue', label='Observed')
plt.plot(future_dates['ds'], predicted_trend, color='red', label='Trend')
plt.plot(future_dates['ds'], predicted_seasonality, color='green', label='Seasonality')

plt.xlabel('Date')
plt.ylabel('Fuel Price')
plt.title('Prophet Forecast')
plt.legend()
plt.show()
In [59]:
forecast = model.predict(df_avg_sales_Fuel)
In [60]:
future = model.make_future_dataframe(periods=365*3)
forecast = model.predict(future)
fig=model.plot(forecast)

fig.gca().set_xlabel("Date")
fig.gca().set_ylabel("Fuel Price")
Out[60]:
Text(100.84722222222221, 0.5, 'Fuel Price')
In [61]:
observed_values = df_avg_sales_Fuel['y']
predicted_values = forecast['yhat']

predicted_trimmed = predicted_values.iloc[:len(observed_values)]

mae = mean_absolute_error(observed_values, predicted_trimmed)
mse = mean_squared_error(observed_values, predicted_trimmed)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 0.05872585431994161
Mean Squared Error (MSE): 0.005055406525859416
Root Mean Squared Error (RMSE): 0.07110138202496077

Explanation of our Prophet model:

We opted to employ the Prophet model as a method for modeling and forecasting our data beacuse it is a robust tool specifically designed to handle time series data characterized by multiple seasonality, missing data, and outliers. It offers flexibility in modeling holidays and trend changes, rendering it suitable for retail sales forecasting. Prophet automatically detects changepoints in trends and integrates them into forecasts. Its parameters include seasonality mode, holidays, and regularization parameters. This model allow us to visualize our data according to the data, bu also our trend and seasonality. Then, when we used it in order to forecast our values, we also obtained a forecast of the trend and seasonality of each variable. In our code we did a forecast on 3 years.

After applying this model to our selected variables, we computed metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) in order to be able to compare its performances with the SARIMA model.

SARIMA Model

Fuel Price

In [62]:
df = pd.read_csv('mean_data.csv')

df_avg_sales_FP = df.groupby(['Date'])['Fuel_Price'].mean()

rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')

df_avg_sales_FP = pd.DataFrame(df_avg_sales_FP)

df_avg_sales_FP.set_index(rng, inplace=True)

df2 = df_avg_sales_FP[["Fuel_Price"]]
df2.set_index(rng, inplace=True)


df2.plot(figsize=(30,10), linewidth=5, fontsize=20)
plt.title("Plot of Fuel_Price", fontsize=30)
plt.xlabel('Date', fontsize=20)
plt.ylabel('Fuel_Price', fontsize=20)
plt.show()
In [63]:
import statsmodels.api as sm

ordre_sarima = (1, 0, 0)

# Seasonal order (P, D, Q, s)
ordre_saisonnier_1 = (0, 1, 0, 12)
ordre_saisonnier_2 = (1, 1, 1, 12)
ordre_saisonnier_3 = (0, 1, 1, 12)
ordre_saisonnier_4 = (0, 0, 1, 12)
ordre_saisonnier_5 = (1, 1, 0, 12)
In [64]:
modele_sarima_3 = sm.tsa.SARIMAX(df2, order=ordre_sarima, seasonal_order=ordre_saisonnier_3, exog=None)
modele_sarima_3_fit = modele_sarima_3.fit(disp=False)
print(modele_sarima_3_fit.summary())

modele_sarima_3_fit.plot_diagnostics(lags=20, figsize=(15, 12))
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:1009: UserWarning: Non-invertible starting seasonal moving average Using zeros as starting parameters.
  warn('Non-invertible starting seasonal moving average'
                                      SARIMAX Results                                       
============================================================================================
Dep. Variable:                           Fuel_Price   No. Observations:                  143
Model:             SARIMAX(1, 0, 0)x(0, 1, [1], 12)   Log Likelihood                 200.211
Date:                              Thu, 09 May 2024   AIC                           -394.423
Time:                                      17:04:07   BIC                           -385.797
Sample:                                  02-07-2010   HQIC                          -390.918
                                       - 10-28-2012                                         
Covariance Type:                                opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.9993      0.318      3.142      0.002       0.376       1.623
ma.S.L12      -0.9982      9.140     -0.109      0.913     -18.913      16.916
sigma2         0.0022      0.020      0.113      0.910      -0.036       0.041
===================================================================================
Ljung-Box (L1) (Q):                  52.88   Jarque-Bera (JB):                 0.30
Prob(Q):                              0.00   Prob(JB):                         0.86
Heteroskedasticity (H):               2.54   Skew:                             0.10
Prob(H) (two-sided):                  0.00   Kurtosis:                         2.88
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [65]:
forecast = modele_sarima_3_fit.get_forecast(steps=143)
forecast_values = forecast.predicted_mean
forecast_conf_int = forecast.conf_int()

plt.figure(figsize=(12, 6))
plt.plot(df2, label='Original Series')
plt.plot(forecast_values.index, forecast_values.values, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')

plt.xlabel('Date')
plt.ylabel('Fuel Price')
plt.title('Original Series and Forecast')
plt.legend()
plt.grid(True)
plt.show()
In [66]:
mae = mean_absolute_error(df2, forecast_values)
mse = mean_squared_error(df2, forecast_values)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 0.4633197017480666
Mean Squared Error (MSE): 0.3640109180565085
Root Mean Squared Error (RMSE): 0.6033331733433099

Weekly_Sales

In [67]:
df = pd.read_csv('mean_data.csv')

df_avg_sales_WS = df.groupby(['Date'])['Weekly_Sales'].mean()

rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')

df_avg_sales_WS = pd.DataFrame(df_avg_sales_WS)

df_avg_sales_WS.set_index(rng, inplace=True)

df2 = df_avg_sales_WS[["Weekly_Sales"]]
df2.set_index(rng, inplace=True)

df2.plot(figsize=(30,10), linewidth=5, fontsize=20)
plt.title("Plot of Weekly_Sales", fontsize=30)
plt.xlabel('Year', fontsize=20)
plt.ylabel('Weekly_Sales', fontsize=20)
plt.show()
In [68]:
ordre_sarima = (1, 0, 0)

# Seasonal order (P, D, Q, s)
ordre_saisonnier_1 = (0, 1, 0, 12)
ordre_saisonnier_2 = (1, 1, 1, 12)
ordre_saisonnier_3 = (0, 1, 1, 12)
ordre_saisonnier_4 = (0, 0, 1, 12)
ordre_saisonnier_5 = (1, 0, 2, 12)
In [69]:
modele_sarima_3 = sm.tsa.SARIMAX(df2, order=ordre_sarima, seasonal_order=ordre_saisonnier_3, exog=None)
modele_sarima_3_fit = modele_sarima_3.fit(disp=False)
print(modele_sarima_3_fit.summary())

modele_sarima_3_fit.plot_diagnostics(lags=20, figsize=(15, 12))
plt.show()
                                      SARIMAX Results                                       
============================================================================================
Dep. Variable:                         Weekly_Sales   No. Observations:                  143
Model:             SARIMAX(1, 0, 0)x(0, 1, [1], 12)   Log Likelihood               -1755.118
Date:                              Thu, 09 May 2024   AIC                           3516.237
Time:                                      17:04:11   BIC                           3524.862
Sample:                                  02-07-2010   HQIC                          3519.742
                                       - 10-28-2012                                         
Covariance Type:                                opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6550      0.054     12.052      0.000       0.548       0.762
ma.S.L12      -0.6707      0.135     -4.963      0.000      -0.936      -0.406
sigma2       2.98e+10    6.9e-13   4.32e+22      0.000    2.98e+10    2.98e+10
===================================================================================
Ljung-Box (L1) (Q):                   3.33   Jarque-Bera (JB):                84.94
Prob(Q):                              0.07   Prob(JB):                         0.00
Heteroskedasticity (H):               0.25   Skew:                             0.39
Prob(H) (two-sided):                  0.00   Kurtosis:                         6.87
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 2.98e+38. Standard errors may be unstable.
In [70]:
forecast = modele_sarima_3_fit.get_forecast(steps=143)
forecast_values = forecast.predicted_mean
forecast_conf_int = forecast.conf_int()

plt.figure(figsize=(12, 6))
plt.plot(df2, label='Original Series')
plt.plot(forecast_values.index, forecast_values.values, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')

plt.xlabel('Date')
plt.ylabel('Weekly Sales')
plt.title('Original Series and Forecast')
plt.legend()
plt.grid(True)
plt.show()
In [71]:
mae = mean_absolute_error(df2, forecast_values)
mse = mean_squared_error(df2, forecast_values)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 70117.34550314884
Mean Squared Error (MSE): 16009144406.13689
Root Mean Squared Error (RMSE): 126527.2476826114

Temperature

In [72]:
df = pd.read_csv('mean_data.csv')

df_avg_sales_T = df.groupby(['Date'])['Temperature'].mean()

rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')

df_avg_sales_T = pd.DataFrame(df_avg_sales_T)

df_avg_sales_T.set_index(rng, inplace=True)

df2 = df_avg_sales_T[["Temperature"]]
df2.set_index(rng, inplace=True)

df2.plot(figsize=(30,10), linewidth=5, fontsize=20)
plt.title("Plot of Temperatures", fontsize=30)
plt.xlabel('Year', fontsize=20)
plt.ylabel('Temperatures', fontsize=20)
plt.show()
In [73]:
ordre_sarima = (1, 0, 0)

# Seasonal order (P, D, Q, s)
ordre_saisonnier_1 = (0, 1, 0, 12)
ordre_saisonnier_2 = (1, 1, 1, 12)
ordre_saisonnier_3 = (0, 1, 1, 12)
ordre_saisonnier_4 = (0, 0, 1, 12)
ordre_saisonnier_5 = (1, 1, 0, 12)
In [74]:
modele_sarima_3 = sm.tsa.SARIMAX(df2, order=ordre_sarima, seasonal_order=ordre_saisonnier_5, exog=None)
modele_sarima_3_fit = modele_sarima_3.fit(disp=False)
print(modele_sarima_3_fit.summary())

modele_sarima_3_fit.plot_diagnostics(lags=20, figsize=(15, 12))
plt.show()
                                     SARIMAX Results                                      
==========================================================================================
Dep. Variable:                        Temperature   No. Observations:                  143
Model:             SARIMAX(1, 0, 0)x(1, 1, 0, 12)   Log Likelihood                -381.213
Date:                            Thu, 09 May 2024   AIC                            768.426
Time:                                    17:04:15   BIC                            777.051
Sample:                                02-07-2010   HQIC                           771.931
                                     - 10-28-2012                                         
Covariance Type:                              opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.9812      0.020     50.213      0.000       0.943       1.020
ar.S.L12      -0.3216      0.101     -3.196      0.001      -0.519      -0.124
sigma2        19.1196      2.397      7.977      0.000      14.422      23.818
===================================================================================
Ljung-Box (L1) (Q):                   4.22   Jarque-Bera (JB):                 0.50
Prob(Q):                              0.04   Prob(JB):                         0.78
Heteroskedasticity (H):               0.96   Skew:                             0.15
Prob(H) (two-sided):                  0.90   Kurtosis:                         3.06
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [75]:
forecast = modele_sarima_3_fit.get_forecast(steps=143)
forecast_values = forecast.predicted_mean
forecast_conf_int = forecast.conf_int()

plt.figure(figsize=(12, 6))
plt.plot(df2, label='Original Series')
plt.plot(forecast_values.index, forecast_values.values, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')

plt.xlabel('Date')
plt.ylabel('Temperatures')
plt.title('Original Series and Forecast')
plt.legend()
plt.grid(True)
plt.show()
In [76]:
mae = mean_absolute_error(df2, forecast_values)
mse = mean_squared_error(df2, forecast_values)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 25.442921157096524
Mean Squared Error (MSE): 885.1674497372217
Root Mean Squared Error (RMSE): 29.751763808843698

Unemployment

In [77]:
df = pd.read_csv('mean_data.csv')

df_avg_sales_U = df.groupby(['Date'])['Unemployment'].mean()

rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')

df_avg_sales_U = pd.DataFrame(df_avg_sales_U)

df_avg_sales_U.set_index(rng, inplace=True)

df2 = df_avg_sales_U[["Unemployment"]]
df2.set_index(rng, inplace=True)

df2.plot(figsize=(30,10), linewidth=5, fontsize=20)
plt.title("Plot of Unemployment", fontsize=30)
plt.xlabel('Year', fontsize=20)
plt.ylabel('Unemployment', fontsize=20)
plt.show()
In [78]:
ordre_sarima = (1, 0, 0)

# Seasonal order (P, D, Q, s)
ordre_saisonnier_1 = (0, 1, 0, 12)
ordre_saisonnier_2 = (1, 1, 1, 12)
ordre_saisonnier_3 = (0, 1, 1, 12)
ordre_saisonnier_4 = (0, 0, 1, 12)
ordre_saisonnier_5 = (1, 1, 0, 12)
In [79]:
modele_sarima_3 = sm.tsa.SARIMAX(df2, order=ordre_sarima, seasonal_order=ordre_saisonnier_2, exog=None)
modele_sarima_3_fit = modele_sarima_3.fit(disp=False)
print(modele_sarima_3_fit.summary())

modele_sarima_3_fit.plot_diagnostics(lags=20, figsize=(15, 12))
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "
                                      SARIMAX Results                                       
============================================================================================
Dep. Variable:                         Unemployment   No. Observations:                  143
Model:             SARIMAX(1, 0, 0)x(1, 1, [1], 12)   Log Likelihood                 188.641
Date:                              Thu, 09 May 2024   AIC                           -369.281
Time:                                      17:04:24   BIC                           -357.780
Sample:                                  02-07-2010   HQIC                          -364.608
                                       - 10-28-2012                                         
Covariance Type:                                opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.9988      0.011     94.137      0.000       0.978       1.020
ar.S.L12       0.0781      0.166      0.471      0.637      -0.246       0.403
ma.S.L12      -0.8993      0.303     -2.965      0.003      -1.494      -0.305
sigma2         0.0028      0.001      5.624      0.000       0.002       0.004
===================================================================================
Ljung-Box (L1) (Q):                   1.65   Jarque-Bera (JB):              1938.35
Prob(Q):                              0.20   Prob(JB):                         0.00
Heteroskedasticity (H):               4.66   Skew:                            -3.85
Prob(H) (two-sided):                  0.00   Kurtosis:                        20.20
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [80]:
forecast = modele_sarima_3_fit.get_forecast(steps=143)
forecast_values = forecast.predicted_mean
forecast_conf_int = forecast.conf_int()

plt.figure(figsize=(12, 6))
plt.plot(df2, label='Original Series')
plt.plot(forecast_values.index, forecast_values.values, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')

plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.title('Original Series and Forecast')
plt.legend()
plt.grid(True)
plt.show()
In [81]:
mae = mean_absolute_error(df2, forecast_values)
mse = mean_squared_error(df2, forecast_values)
rmse = np.sqrt(mse)

print("Mean Absolute Error (MAE):", mae)
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
Mean Absolute Error (MAE): 1.7637102946401182
Mean Squared Error (MSE): 3.133634990476137
Root Mean Squared Error (RMSE): 1.7702076122523418

Explanation of our SARIMA model:

To fit the SARIMA model, appropriate orders for autoregressive, differencing, moving average, and seasonal components needed to be selected. This process involves conducting multiple tests using different parameter values while analyzing autocorrelation plots, normal Q-Q plots, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion) values, and density histograms that we obtained from our diagnosis we ran. Additionally, custom code was developed to identify optimal parameter combinations for each variable.

Based on the analysis we obtained that the following models were the best fit for each variable: SARIMA (1,0,0)(0,1,1,12) for Fuel Price and Weekly Sales variables, SARIMA (1,0,0)(1,1,0,12) for Temperature variable, and SARIMA (1,0,0)(1,1,1,12) for Unemployment variable. The seasonality parameter of 12 was chosen based on the identification of yearly patterns in the data (part 1).

Comparaison between the models and their forecast:

Based on our analysis of visualizations, model diagnostics, and metrics, it is evident that the Prophet model outperforms the SARIMA model for each of our variables. The forecast generated by the Prophet model demonstrates higher precision and likelihood, aligning better with the observed data and the identified trends and seasonality. This conclusion is further supported by the metrics calculated, which consistently show lower Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) values for the Prophet model compared to SARIMA.

For instance, when examining the Unemployment variable, the Prophet model yielded the following metrics: Mean Absolute Error (MAE): 0.023, Mean Squared Error (MSE): 0.001, and Root Mean Squared Error (RMSE): 0.033. In contrast, the SARIMA model produced significantly higher scores: Mean Absolute Error (MAE): 1.763, Mean Squared Error (MSE): 3.133, and Root Mean Squared Error (RMSE): 1.770. This notable disparity in results strongly favors the Prophet model.

Part 3¶

Here we add an inflation variable, exogenous to the dataset that will have an impact on the unemployment and the fuel price of the original dataset.

In [82]:
new_df = pd.read_csv("DATA.csv")
In [83]:
new_df = new_df[['DATE', 'INFLATION(%)']]
new_df.rename(columns={'DATE': 'Date'}, inplace=True)
new_df.rename(columns={'INFLATION(%)': 'Inflation'}, inplace=True)

Selects the parts of the new dataset corresponding to the dates of the original one.

In [84]:
new_df['Date'] = pd.to_datetime(new_df['Date'], format='%d-%m-%Y')
new_df = new_df[(new_df['Date'] >= '2010-02-01') & (new_df['Date'] <= '2012-10-26')]

Visualizations of our exegenous variable:

In [85]:
_plot_series1(new_df, 'Date', 'Inflation', '')

Here, we chose to use an outer join to retain the maximum amount of information available in both sides. Indeed, each source have been recorded data at different times, so, by using an outer join, we can ensure that we have a complete timeline even if one source did not record data at some points.

In [86]:
df_merged = pd.merge(data, new_df, on='Date', how='outer').sort_values(by='Date')
df_merged
Out[86]:
Date Store Weekly_Sales Holiday_Flag Temperature Fuel_Price CPI Unemployment Inflation
171 2010-02-01 NaN NaN NaN NaN NaN NaN NaN 2.143332
0 2010-02-05 23.0 1.105572e+06 0.0 34.037333 2.717844 167.730885 8.619311 NaN
1 2010-02-12 23.0 1.074148e+06 1.0 34.151333 2.694022 167.825608 8.619311 NaN
2 2010-02-19 23.0 1.072822e+06 0.0 37.719778 2.672067 167.871686 8.619311 NaN
3 2010-02-26 23.0 9.770794e+05 0.0 39.243556 2.683933 167.909657 8.619311 NaN
... ... ... ... ... ... ... ... ... ...
143 2012-10-01 NaN NaN NaN NaN NaN NaN NaN 2.162344
139 2012-10-05 23.0 1.057036e+06 0.0 65.973111 3.845222 176.505052 6.953711 NaN
140 2012-10-12 23.0 1.025078e+06 0.0 58.342667 3.896733 176.636515 6.953711 NaN
141 2012-10-19 23.0 1.002720e+06 0.0 60.705333 3.880000 176.652613 6.953711 NaN
142 2012-10-26 23.0 1.012091e+06 0.0 61.051111 3.791489 176.649482 6.953711 NaN

172 rows × 9 columns

As a result of the outer join we used, we have to deal with several NaN values in our new dataset. We used interpolation method to replace theese NaN values. This method provide an accurate way to estimate missing values by considering the trend in the data.

In [87]:
df_merged_inter = df_merged.copy()
df_merged_inter.interpolate(method='linear', inplace=True)
_plot_series3(df_merged_inter, 'Date', 'Inflation', 'Unemployment', '', 'between 2010-02 and 2012-10')
In [88]:
df_merged_un = df_merged[(df_merged['Date'] >= '2010-06-01') & (df_merged['Date'] <= '2011-09-01')]
df_merged_un.interpolate(method='linear', inplace=True)
_plot_series3(df_merged_un, 'Date', 'Inflation', 'Unemployment', '', 'between 2010-06 and 2011-09')
<ipython-input-88-82b5fc0d25ef>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_merged_un.interpolate(method='linear', inplace=True)

Impact of inflation on unemployment:

The relationship between inflation and unemployment is traditionally illustrated by the Phillips Curve. This economic model suggests an inverse relationship between the rate of unemployment and the rate of inflation, when inflation is high, unemployment tends to be lower, and vice versa. We can observe this phenomenon on the whole timeline of our dataset but specificaly on the period between 2010-06 and 2011-09.

In [89]:
_plot_series3(df_merged_inter, 'Date', 'Inflation', 'Fuel_Price', '', 'between 2010-02 and 2012-10')
In [90]:
df_merged_fp = df_merged[(df_merged['Date'] >= '2010-05-01') & (df_merged['Date'] <= '2012-01-01')]
df_merged_fp.interpolate(method='linear', inplace=True)
_plot_series3(df_merged_fp, 'Date', 'Inflation', 'Fuel_Price', '', 'between 2010-05 and 2012-01')
<ipython-input-90-907b100ba714>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_merged_fp.interpolate(method='linear', inplace=True)

Impact of inflation on fuel price:

Inflation means that the general level of prices for goods and services in an economy is rising, so, it has a direct impact on the fuel price. Indeed, the more inflation increases, the more fuel price will increase. We can observe this phenomenon on the whole timeline of our dataset but specificaly on the period between 2010-05 and 2012-01.

In [91]:
df_merged_fp.dropna(subset=['Inflation', 'Fuel_Price'], inplace=True)
df_merged_un.dropna(subset=['Inflation', 'Unemployment'], inplace=True)
<ipython-input-91-202651bdd47a>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_merged_fp.dropna(subset=['Inflation', 'Fuel_Price'], inplace=True)
<ipython-input-91-202651bdd47a>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_merged_un.dropna(subset=['Inflation', 'Unemployment'], inplace=True)

Creation of a regression model in order to better identify the relation between our exegenous variable(inflation) and the fuel prices and unemployment rate from our original data :

In [92]:
import statsmodels.api as sm

def regression_model(df, col1, col2):
    X = sm.add_constant(df[col1])
    model = sm.OLS(df[col2], X).fit()

    print(model.summary())

    df[f'Predicted_{col2}'] = model.predict(X)
    plt.figure(figsize=(10, 6))
    plt.plot(df[col1], df[col2], 'o', label='Data')
    plt.plot(df[col1], df[f'Predicted_{col2}'], 'r--', label='Regression Line')
    plt.xlabel(col1)
    plt.ylabel(col2)
    plt.title(f'Relationship between {col1} and {col2}')
    plt.legend()
    plt.show()
In [93]:
regression_model(df_merged_fp, 'Inflation', 'Fuel_Price')
regression_model(df_merged_un, 'Inflation', 'Unemployment')
<ipython-input-92-ffc44b49fc62>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f'Predicted_{col2}'] = model.predict(X)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:             Fuel_Price   R-squared:                       0.933
Model:                            OLS   Adj. R-squared:                  0.932
Method:                 Least Squares   F-statistic:                     1416.
Date:                Thu, 09 May 2024   Prob (F-statistic):           1.29e-61
Time:                        17:04:33   Log-Likelihood:                 86.924
No. Observations:                 104   AIC:                            -169.8
Df Residuals:                     102   BIC:                            -164.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.3627      0.026     90.145      0.000       2.311       2.415
Inflation      0.3715      0.010     37.625      0.000       0.352       0.391
==============================================================================
Omnibus:                       15.256   Durbin-Watson:                   0.107
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               16.960
Skew:                           0.908   Prob(JB):                     0.000208
Kurtosis:                       3.786   Cond. No.                         7.52
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           Unemployment   R-squared:                       0.936
Model:                            OLS   Adj. R-squared:                  0.935
Method:                 Least Squares   F-statistic:                     1097.
Date:                Thu, 09 May 2024   Prob (F-statistic):           1.60e-46
Time:                        17:04:33   Log-Likelihood:                 143.37
No. Observations:                  77   AIC:                            -282.7
Df Residuals:                      75   BIC:                            -278.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          8.6170      0.010    853.740      0.000       8.597       8.637
Inflation     -0.1338      0.004    -33.124      0.000      -0.142      -0.126
==============================================================================
Omnibus:                        0.459   Durbin-Watson:                   0.381
Prob(Omnibus):                  0.795   Jarque-Bera (JB):                0.588
Skew:                           0.162   Prob(JB):                        0.745
Kurtosis:                       2.720   Cond. No.                         6.59
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
<ipython-input-92-ffc44b49fc62>:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[f'Predicted_{col2}'] = model.predict(X)

This regression support our idea and shows us a clearly correlation between the inflation and the fuel price (positive correlation), but also between the inflation and the Unemployment rate (negative correlation). We can say that these results are logical and consistent with our expectation.

Application of a SARIMAX model:

In [94]:
df_merged = pd.merge(data, new_df, on='Date', how='outer').sort_values(by='Date')
df_merged.interpolate(method='linear', inplace=True)
df_merged.dropna(subset=['Inflation', 'Fuel_Price'], inplace=True)
df_merged.dropna(subset=['Inflation', 'Unemployment'], inplace=True)
In [95]:
date_beg = '2010-02-05'
date_end = '2012-10-31'

df_merged['Date'] = pd.to_datetime(df_merged['Date'])

date_range = pd.date_range(start=date_beg, end=date_end, freq='7D')
df_complete = pd.merge(df_merged, pd.DataFrame({'Date': date_range}), on='Date', how='right')

df_complete.sort_values(by='Date', inplace=True)
df_complete.reset_index(drop=True, inplace=True)
In [96]:
rng = pd.date_range(start='02-05-2010', end='10-31-2012', freq='W')

df_complete = pd.DataFrame(df_complete)

df_complete.set_index(rng, inplace=True)
In [97]:
df_selected = df_complete[['Date', 'Inflation', 'Fuel_Price']]
df_selected.set_index(rng, inplace=True)

df_selected = df_selected.dropna()

Using the Fuel Prices variable

In [98]:
order = (1, 0, 0)
seasonal_order = (0, 1, 1, 12)

endog = np.asarray(df_selected['Fuel_Price'])
exog = np.asarray(df_selected[['Inflation']])
model = sm.tsa.SARIMAX(endog, exog=exog, order=order, seasonal_order=seasonal_order)
In [99]:
results = model.fit(disp=False)

print(results.summary())
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:1009: UserWarning: Non-invertible starting seasonal moving average Using zeros as starting parameters.
  warn('Non-invertible starting seasonal moving average'
                                      SARIMAX Results                                       
============================================================================================
Dep. Variable:                                    y   No. Observations:                  143
Model:             SARIMAX(1, 0, 0)x(0, 1, [1], 12)   Log Likelihood                 208.121
Date:                              Thu, 09 May 2024   AIC                           -408.242
Time:                                      17:05:25   BIC                           -396.742
Sample:                                           0   HQIC                          -403.569
                                              - 143                                         
Covariance Type:                                opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.2025      0.046      4.430      0.000       0.113       0.292
ar.L1          0.9997      0.007    140.910      0.000       0.986       1.014
ma.S.L12      -0.9625      0.443     -2.171      0.030      -1.832      -0.093
sigma2         0.0020      0.001      2.164      0.030       0.000       0.004
===================================================================================
Ljung-Box (L1) (Q):                  46.79   Jarque-Bera (JB):                 0.84
Prob(Q):                              0.00   Prob(JB):                         0.66
Heteroskedasticity (H):               2.63   Skew:                             0.03
Prob(H) (two-sided):                  0.00   Kurtosis:                         2.61
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [100]:
results.plot_diagnostics(figsize=(15, 12))
plt.show()

# Forecasting
forecast_steps = 143
forecast = results.get_forecast(steps=forecast_steps, exog=exog)
forecast_values = forecast.predicted_mean
forecast_conf_int = pd.DataFrame(forecast.conf_int())
forecast_conf_int.set_index(df_selected.index, inplace=True)

plt.figure(figsize=(10, 5))

forecast_df = pd.DataFrame(forecast_values, index=df_selected.index)

# Plot original and forecast
plt.figure(figsize=(10, 5))
plt.plot(df_selected["Fuel_Price"], label='Original Series')
plt.plot(forecast_df, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.title('Fuel Price Forecast')
plt.xlabel('Date')
plt.ylabel('Fuel Price')
plt.legend()
plt.show()


# Evaluate forecast accuracy
actuals = df_selected['Fuel_Price'][-forecast_steps:]
mae = mean_absolute_error(actuals, forecast_values)
mse = mean_squared_error(actuals, forecast_values)
rmse = np.sqrt(mse)

print("Forecast Accuracy Measures:")
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
<Figure size 1000x500 with 0 Axes>
Forecast Accuracy Measures:
MAE: 0.8458238368744987
MSE: 0.7542514868360004
RMSE: 0.8684765321158657

Based on this analysis we can say that our model fit pretty good our data and our added variable. Moreover the forecast accuracy measures, including MAE, MSE, and RMSE, provide insights into the performance of the prediction model. Despite some level of error, the model demonstrates a reasonable ability to predict future values based on the given data and exogenous variables.

Using the Unemployment variable

In [101]:
df_selected2 = df_complete[['Date', 'Inflation', 'Unemployment']]
df_selected2.set_index(rng, inplace=True)

df_selected2 = df_selected2.dropna()
In [102]:
order = (1, 0, 0)
seasonal_order = (1, 1, 1, 12)

endog2 = np.asarray(df_selected2['Unemployment'])
exog2 = np.asarray(df_selected2[['Inflation']])
model = sm.tsa.SARIMAX(endog2, exog=exog2, order=order, seasonal_order=seasonal_order)
In [103]:
results = model.fit(disp=False)
print(results.summary())
                                      SARIMAX Results                                       
============================================================================================
Dep. Variable:                                    y   No. Observations:                  143
Model:             SARIMAX(1, 0, 0)x(1, 1, [1], 12)   Log Likelihood                 188.709
Date:                              Thu, 09 May 2024   AIC                           -367.418
Time:                                      17:05:56   BIC                           -353.042
Sample:                                           0   HQIC                          -361.577
                                              - 143                                         
Covariance Type:                                opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.0171      0.072      0.238      0.812      -0.124       0.158
ar.L1          0.9992      0.008    126.362      0.000       0.984       1.015
ar.S.L12       0.0892      0.165      0.542      0.588      -0.233       0.412
ma.S.L12      -0.9160      0.346     -2.649      0.008      -1.594      -0.238
sigma2         0.0028      0.001      4.647      0.000       0.002       0.004
===================================================================================
Ljung-Box (L1) (Q):                   1.71   Jarque-Bera (JB):              1927.60
Prob(Q):                              0.19   Prob(JB):                         0.00
Heteroskedasticity (H):               4.62   Skew:                            -3.85
Prob(H) (two-sided):                  0.00   Kurtosis:                        20.14
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [104]:
results.plot_diagnostics(figsize=(15, 12))
plt.show()

# Forecasting
forecast_steps = 143
forecast = results.get_forecast(steps=forecast_steps, exog=exog)
forecast_values = forecast.predicted_mean
forecast_conf_int = pd.DataFrame(forecast.conf_int())
forecast_conf_int.set_index(df_selected2.index, inplace=True)


plt.figure(figsize=(10, 5))

forecast_df = pd.DataFrame(forecast_values, index=df_selected2.index)

# Plot original and forecast
plt.figure(figsize=(10, 5))
plt.plot(df_selected2["Unemployment"], label='Original Series')
plt.plot(forecast_df, color='red', label='Forecast')
plt.fill_between(forecast_conf_int.index, forecast_conf_int.iloc[:, 0], forecast_conf_int.iloc[:, 1], color='pink', alpha=0.3, label='Confidence Interval')
plt.title('Unemployment Forecast')
plt.xlabel('Date')
plt.ylabel('Unemployment')
plt.legend()
plt.show()


# Evaluate forecast accuracy
actuals = df_selected2['Unemployment'][-forecast_steps:]
mae = mean_absolute_error(actuals, forecast_values)
mse = mean_squared_error(actuals, forecast_values)
rmse = np.sqrt(mse)

print("Forecast Accuracy Measures:")
print(f"MAE: {mae}")
print(f"MSE: {mse}")
print(f"RMSE: {rmse}")
<Figure size 1000x500 with 0 Axes>
Forecast Accuracy Measures:
MAE: 1.7840589351319822
MSE: 3.202114894278942
RMSE: 1.7894454152834454

In the same way, we can say that the model is a good fitting to our data and has accurate predictions while having higher values in the accuracy metrics than them obtained with our Fuel prices variable.